{
 "metadata": {
  "name": "",
  "signature": "sha256:cf65ac9a739727f06cd0a90ef529a89c4292be80bac4f6551b27b5c3627bdddd"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "heading",
     "level": 1,
     "metadata": {},
     "source": [
      "Introduction to Logistic Regression"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Scott Hendrickson  (2014-03-20)\n",
      "\n",
      "Classification: \n",
      "\n",
      "* Learn classification of vectors from labeled training data\n",
      "* Vectors of N dimensions\n",
      "* Labels: {0,1}\n",
      "* Generalize to predict class of new vectors"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import numpy as np\n",
      "import pandas as pd \n",
      "from ggplot import *\n",
      "%matplotlib inline"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Simple 1-D Example"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "dat_1d = {\"x\"    :[2, 2, 1, 0.25, -0.4, -1, -2, -2, 1], \n",
      "       \"label\":[0, 0, 0,    0,    1,  1,  1,  1, 1]}\n",
      "df_1d = pd.DataFrame.from_dict(data = dat_1d)\n",
      "df_1d"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "df_1d[\"label_fact\"] = pd.Categorical.from_array([str(x) for x in df_1d.label])\n",
      "ggplot(aes(x=\"x\",y=\"label\", color=\"label_fact\"), data=df_1d) + geom_point()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Sigmoid Function"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Properties of function of distance from boundary:\n",
      "\n",
      "* Continuous\n",
      "* Domain of all reals\n",
      "* Range [0,1]\n",
      "* Convex function (differentiable, $\\frac{d s(x)}{dx} > 0 \\text{ } \\forall \\text{ } x$\n",
      "\n",
      "$f(x) = \\frac{1}{1 + \\exp(-z)}$\n",
      "\n"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def sigmoid(x):\n",
      "    # vector in, vector out\n",
      "    return 1./(1. + np.exp(-x))\n",
      "\n",
      "# Let's see it\n",
      "d_sig = pd.DataFrame.from_dict({\"x\": np.linspace(-8,8,30), \n",
      "                                \"y\": sigmoid(np.linspace(-8,8,30))})\n",
      "ggplot(aes(x=\"x\", y=\"y\"), data = d_sig) + geom_line(color=\"orange\")"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Model"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Model predicts $l(x) = [0,1]$ based on sigmoid.  Calculate argument to sigmoid from input data, e.g.:\n",
      "\n",
      "For 1-D input vectors:\n",
      "\n",
      "   $z = \\beta_0 + \\beta_1 x$\n",
      "\n",
      "Or 2-D input vectors:\n",
      "\n",
      "   $z = \\beta_0 + \\beta_1 x_1 + \\beta_2 x_2$\n",
      "\n",
      "$\\ldots$\n",
      "\n",
      "Model prediction in 1-D:\n",
      "\n",
      "$y(x) = sigmoid(\\beta_0 + \\beta_1 x) = \\frac{1}{1 + \\exp(-(\\beta_0 + \\beta_1 x))}$\n",
      "\n"
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Cost Function"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "\n",
      "\n",
      "Need a cost function for which optimization will give the boundary N-dim plane defined by $[\\beta_0, \\beta_1, \\ldots ]$.\n",
      "\n",
      "###Least Squares?\n",
      "\n",
      "$Cost = \\frac{1}{m} \\sum{\\frac{1}{2}(l(x) - y(x))^2}$\n",
      "\n",
      "This is the cost if the model prediction is $y(x)$ and the actual outcome is $l(x)$. With the sigmoid, this funciton is not convex, so local minima are challenging.\n",
      "\n",
      "Why is this a problem?\n",
      "\n",
      "Lots of local minima mean gradient descent may not find the global optimum. We would like a convex function so if you run gradient descent and converge to the global minimum.\n",
      "\n",
      "###Logistic Regression Cost Function\n",
      "For model y(x):\n",
      "    \n",
      "$C = -[1-l(x)]\\ln[l(x)-y(x)] - l(x) \\ln[y(x)] $\n",
      "\n",
      "$C = -[\\text{cost for label 0}] - [\\text{cost for label 1}]$"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def cost(label_arr, model_arr):\n",
      "    # inputs are nb.array([])\n",
      "    # sum over training set\n",
      "    c = -(1. - label_arr)*np.log(1. - model_arr) - label_arr*np.log(model_arr)\n",
      "    return np.sum(c)/len(c)\n",
      "print \"Model  ~ Label: {:.4}\".format(cost(np.array([0,0,0,1,1,1]), np.array([0.1,0.2,0.05,0.99,0.98,0.99])))\n",
      "print \"Model !~ Label: {:.4}\".format(cost(np.array([0,0,0,1,1,1]), np.array([0.99,0.98,0.99,0.1,0.2,0.05])))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def model_1d(model_vec, df):\n",
      "    # Calculate model output\n",
      "    z = model_vec[0] + model_vec[1] * df.x\n",
      "    return sigmoid(z)\n",
      "    \n",
      "def model_cost(model_vec, df, model):\n",
      "    # model_vec is python vector\n",
      "    # df is data frame with column x and column label\n",
      "    # model is a functionto calculate model output\n",
      "    return cost(df.label, model(model_vec, df))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from scipy.optimize import minimize as mini\n",
      "res_1d = mini(model_cost, x0=[0.1,0.1], args=(df_1d, model_1d)) \n",
      "print (res_1d)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "df_pred_1d = pd.DataFrame.from_dict({\"x\": np.linspace(-3,3,99)})\n",
      "df_pred_1d[\"y\"] = model_1d(res_1d.x, df_pred_1d)\n",
      "df_pred_1d.head()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "ggplot(aes(x=\"x\",y=\"label\", color=\"label_fact\"), data=df_1d) + \\\n",
      "  geom_point() + \\\n",
      "  geom_line(aes(x=\"x\", y=\"y\"), color=\"green\", data=df_pred_1d)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Example - 2D"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "dat_2d = {   \"x1\":[-1.0, 0.3, 2.05, 0.95,  2,-1.1,-2.4,-0.9,-2.1], \n",
      "             \"x2\":[ 2.0, 0.6,  2.1,  2.1, -1,-1  ,-1.8,  -2, 0.0], \n",
      "       \"label\":[0,0,0,0,0,1,1,1,1]}\n",
      "df_2d = pd.DataFrame.from_dict(data = dat_2d)\n",
      "df_2d[\"label_fact\"] = pd.Categorical.from_array([str(x) for x in df_2d.label])\n",
      "df_2d"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "ggplot(aes(x=\"x1\",y=\"x2\", color=\"label_fact\"), data=df_2d) + geom_point()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def model_2d(model_vec, df):\n",
      "    z = model_vec[0] + model_vec[1] * df.x1 + model_vec[2] * df.x2\n",
      "    return sigmoid(z)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "res_2d = mini(model_cost, x0=[1,1,0.5], args=(df_2d, model_2d)) \n",
      "print (res_2d)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# z = 0 defines a line\n",
      "x1 = np.linspace(-3,3,27)\n",
      "x2 = -(res_2d.x[0] + x1*res_2d.x[1])/res_2d.x[2]\n",
      "df_pred_2d = pd.DataFrame.from_dict({ \"x1\": x1, \n",
      "                                      \"x2\": x2  })\n",
      "df_pred_2d[\"y\"] = model_2d(res_2d.x, df_pred_2d)\n",
      "df_pred_2d.head()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "ggplot(aes(x=\"x1\", y=\"x2\"),data=df_pred_2d) + geom_line() + geom_point(aes(x=\"x1\",y=\"x2\", color=\"label_fact\"), data=df_2d)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from scipy.stats import norm\n",
      "# Make some data\n",
      "npts = 65\n",
      "x1 = np.concatenate((norm.rvs(loc=2, scale=1.5,   size=npts),  norm.rvs(loc=-2, scale=1.5, size=npts)))\n",
      "x2 = np.concatenate((norm.rvs(loc=2, scale=1.5, size=npts),  norm.rvs(loc=-1, scale=1.5,   size=npts)))\n",
      "lab = [0]*npts + [1]*npts\n",
      "# Plots and fits\n",
      "dat_gaus = {  \"x1\": x1, \n",
      "              \"x2\": x2 , \n",
      "           \"label\": lab }\n",
      "df_2d_gaus = pd.DataFrame.from_dict(data = dat_gaus)\n",
      "df_2d_gaus[\"label_fact\"] = pd.Categorical.from_array([str(x) for x in df_2d_gaus.label])\n",
      "# Fit\n",
      "res_2d_gaus = mini(model_cost, x0=[0.1,1.2,0.5], args=(df_2d_gaus, model_2d))\n",
      "# z = 0 defines a line\n",
      "x1 = np.linspace(-4, 4, 27)\n",
      "x2 = -(res_2d_gaus.x[0] + x1*res_2d_gaus.x[1])/res_2d_gaus.x[2]\n",
      "print res_2d_gaus\n",
      "df_2d_gaus_pred = pd.DataFrame.from_dict({\"x1\": x1, \"x2\": x2})\n",
      "df_2d_gaus_pred[\"y\"] = model_2d(res_2d_gaus.x, df_2d_gaus)\n",
      "ggplot(aes(x=\"x1\", y=\"x2\"), data=df_2d_gaus_pred) + \\\n",
      "     geom_line(color=\"red\") + \\\n",
      "     geom_point(aes(x=\"x1\",y=\"x2\", color=\"label_fact\"), data=df_2d_gaus)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def model_quad(model_vec, df):\n",
      "    # quadratic\n",
      "    z = model_vec[0] + model_vec[1]*df.x2 + model_vec[2]*df.x1*df.x1 + model_vec[3]*df.x1\n",
      "    return sigmoid(z)\n",
      "# Make some data\n",
      "npts = 100\n",
      "x1 = np.concatenate((norm.rvs(loc=-3,scale=1, size=npts), \n",
      "                     norm.rvs(loc=0, scale=1, size=npts),\n",
      "                     norm.rvs(loc=0, scale=1, size=npts),\n",
      "                     norm.rvs(loc=3, scale=1, size=npts)))\n",
      "x2 = np.concatenate((norm.rvs(loc=4, scale=6, size=npts), \n",
      "                     norm.rvs(loc=-5, scale=2, size=npts),\n",
      "                     norm.rvs(loc=6, scale=2, size=npts),\n",
      "                     norm.rvs(loc=4, scale=6, size=npts)))\n",
      "lab = [0]*npts + [1]*npts + [0]*npts + [0]*npts\n",
      "# Plots and fits\n",
      "dat_quad = {    \"x1\": x1, \n",
      "                \"x2\": x2, \n",
      "             \"label\": lab }\n",
      "df_quad = pd.DataFrame.from_dict(data = dat_quad)\n",
      "df_quad[\"label_fact\"] = pd.Categorical.from_array([str(x) for x in df_quad.label])\n",
      "# Fit\n",
      "res_quad = mini(model_cost, x0=[0.5, 1.2, -1, 2], args=(df_quad, model_quad))\n",
      "print res_quad\n",
      "# z = 0 defines a line\n",
      "x1 = np.linspace(-3,3,21)\n",
      "x2 = -(res_quad.x[0] + res_quad.x[3]*x1 + res_quad.x[2]*x1*x1)/res_quad.x[1]\n",
      "df_quad_pred = pd.DataFrame.from_dict({\"x1\": x1, \"x2\": x2})\n",
      "df_quad_pred[\"y\"] = model_2d(res_quad.x, df_quad)\n",
      "ggplot(aes(x=\"x1\", y=\"x2\"), data=df_quad_pred) + \\\n",
      "     geom_line(color=\"red\") + \\\n",
      "     geom_point(aes(x=\"x1\",y=\"x2\", color=\"label_fact\"), data=df_quad)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "##EndNotes:\n",
      "\n",
      "Followup ideas that are important to success with machine learning:\n",
      "\n",
      "* Split data into training set and test set, hold out test set and measure error in both sets during training to estimate point where generalization breaks down due to over-fitting!  This may mean you don't use a canned optimizer as I did in these demos.\n",
      "* Dimensionlity: As you get more dimensions in your input vector, it is harder to fill the space with data $Volume \\propto d^{N_d}$.  This means addition another dimension requires lots more data!\n",
      "* Regularization helps limit complexity of model. Add a term like $C = \\ldots + \\sum{\\beta_i}^2$ to penalize having many large model parameters $\\beta$ in the model.\n",
      "* Don't use this code, us scikits-learn!\n",
      "* Look at Baysian approach that gives you distributions of separating surfaces.\n"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [],
     "language": "python",
     "metadata": {},
     "outputs": []
    }
   ],
   "metadata": {}
  }
 ]
}